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Analysis of functional magnetic resonance imaging (fMRI) data is becoming ever 
more computationally demanding as temporal and spatial resolutions improve, and 
large, publicly available data sets proliferate. Moreover, methodological improvements 
in the neuroimaging pipeline, such as non-linear spatial normalization, non-parametric 
permutation tests and Bayesian Markov Chain Monte Carlo approaches, can dramatically 
increase the computational burden. Despite these challenges, there do not yet exist any 
fMRI software packages which leverage inexpensive and powerful graphics processing 
units (GPUs) to perform these analyses. Here, we therefore present BROCCOLI, a 
free software package written in OpenCL (Open Computing Language) that can be 
used for parallel analysis of fMRI data on a large variety of hardware configurations. 
BROCCOLI has, for example, been tested with an Intel CPU, an Nvidia GPU, and 
an AMD GPU. These tests show that parallel processing of fMRI data can lead 
to significantly faster analysis pipelines. This speedup can be achieved on relatively 
standard hardware, but further, dramatic speed improvements require only a modest 
investment in GPU hardware. BROCCOLI (running on a GPU) can perform non-linear 
spatial normalization to a 1 mm-^ brain template in 4-6 s, and run a second level 
permutation test with 10,000 permutations in about a minute. These non-parametric 
tests are generally more robust than their parametric counterparts, and can also enable 
more sophisticated analyses by estimating complicated null distributions. Additionally, 
BROCCOLI includes support for Bayesian first-level fMRI analysis using a Gibbs sampler. 
The new software is freely available under GNU GPL3 and can be downloaded from github 
(https://github.com/wanderine/BROCCOLI/). 

Keywords: Neuroimaging, fIVIRI, Spatial normalization, GPU, CUDA, OpenCL, Image registration. Permutation test 



1. INTRODUCTION 

Functional magnetic resonance imaging (fMRI) has become the 
de facto standard methodology in contemporary efforts to image 
the functioning of the human brain in both health and dis- 
ease. Nonetheless, fMRI-based research arguably lags behind in 
its adoption of recent advances in computer hardware, despite 
several recent trends that have underlined the need for greater 
computational resources. First, the temporal and the spatial reso- 
lution of fMRI data continues to improve with stronger magnetic 
fields and more advanced scanning protocols (Moeller et al., 
2010; Feinberg and Yacoub, 2012), leading to the production 
of significantly larger datasets. Second, fMRI studies are trend- 
ing toward larger numbers of subjects to increase their statistical 
power (Ekiund et al., 2012a; Thyreau et al., 2012; Button et al., 
2013) sometimes aided by a proliferation of data sharing initia- 
tives (Biswal et al., 2010; Poldrack et al, 2013) ^'^ that provide 
open access to large amounts of data. The human connectome 

^ http://fcon_1000.projects.nitrc.org/fcpClassic/FcpTable.html 
^https://openfmri.org/ 



project (van Essen et al., 2013) ^ for example, shares high 
resolution data from a large number of subjects (the goal is 1200), 
and a single resting state scan results in a dataset of the size 
104 X 90 X 72 X 1200. Third, non-parametric methods based on 
permutation and Bayesian Markov Chain Monte Carlo (MCMC) 
methods are more frequently being used to improve neuroimag- 
ing statistics (da Silva, 2011; Ekiund et al., 2012a, 2013b), but 
suffer from long processing times compared to conventional para- 
metric methods. Some progress toward parallelization has been 
made in each of the three major packages commonly used in 
fMRI-based research (SPM, FSL, and AFNI). For example, AFNI 
has direct support for running some functions in parallel on 
several CPU cores, using the open multi-processing (OpenMP) 
library; FSL can take advantage of several computers or CPU 
cores, by installing packages like Condor or GridEngine, and 
has recently added graphics processing unit (GPU) support for 
MCMC based diffusion tensor analysis (Hernandez et al., 2013); 
and Huang et al. (2011) recently proposed to accelerate image 



^http://www.humanconnectome.org/ 
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registration in SPM by using a GPU. Moreover, a number of 
prominent projects are underway to enable big data approaches 
to functional neuroimaging at large supercomputing centers 
[e.g., (Lavoie-Courchesne et al., 2012)]. At this stage, however, 
these approaches still require a significant investment of time and 
effort by expert technical staff and thus remain inaccessible to the 
majority of investigators. Thus, despite efforts by existing anal- 
ysis packages, we feel that the community could benefit from 
a more comprehensive focus on parallel computation. Further, 
being relatively new, GPUs offer some unique challenges as well 
as promising potential benefits. 

Since the introduction of the CUDA programming language 
in 2007, general purpose computing on graphics processing units 
(GPGPU) (Owens et al, 2007) has gained prominence in a wide 
range of scientific fields, including medical imaging (Shams et al, 
2010; Pratx and Xing, 2011; Eklund et al, 2013a) and neuro- 
science (Jeong et al., 2010; Pezoa et al., 2012; Ben-Shalom et al., 
2013; Hoang et al., 2013; Yamazaki and Igarashi, 2013). The main 
reasons are that GPUs are inexpensive, power efficient and able 
to run several thousand threads in parallel, commonly provid- 
ing a performance boost of 1-2 orders of magnitude for a small 
investment (see Table 1). Nonetheless, GPGPU is still uncommon 
in the neuroimaging field, where medical imaging and neuro- 
science intersect. Here, we therefore present BROCCOLI, a free 
software for parallel analysis of fMRI data on many-core CPUs 
and GPUs. BROCCOLI contains a large number of additions 
and improvements over our previous work (Eklund et al, 2010, 
2011a; Forsberg et al, 2011; Eklund et al, 2012b). Some exam- 
ples are Bayesian fMRI analysis using MCMC, first level statistical 
analysis using the Cochrane-Orcutt procedure (Cochrane and 
Orcutt, 1949), linear and non-linear registration for an arbi- 
trary number of scales and support for _F-tests as well as a 
larger number of regressors. While our previous implementa- 
tions used CUDA, the most popular programming language for 
GPGPU, BROCCOLI is instead written in the open computing 
language (OpenCL) [see e.g., Munshi et al. (2011)]. This makes it 
possible to run BROCCOLI on many types of hardware, includ- 
ing CPUs, Nvidia GPUs, AMD GPUs, field programmable gate 
arrays (FPGAs), digital signal processors (DSPs) and other accel- 
erators (e.g., the Intel Xeon Phi). As neuroimaging researchers 
use a wide range of operating systems (Hanke and Halchenko, 
2011), it is also important that BROCCOLI can run efficiently 
regardless of the platform. One way to achieve this is to develop 
BROCCOLI for a specific platform (e.g., Windows), and then 
simply run BROCCOLI through a virtual machine for other 



platforms (e.g., Linux). However, direct access to GPU hard- 
ware through a virtual machine can currently be problematic, 
and was therefore not an option for our software. Instead, we 
have developed BROCCOLI using a combination of the platform- 
independent languages OpenCL and C-|~|-, and have made the 
source code freely available so that it can be compiled on any 
desired operating system supporting these widely deployed stan- 
dards. In addition, as an added convenience, we have provided 
pre-compUed libraries for the Linux and Windows operating sys- 
tems that can be linked to projects developed on either platform. 
A wrapper for Matlab is currently available, a Python wrapper is 
being developed and future plans include wrappers for bash and 
R. In addition to the improvements described above, BROCCOLI 
has also been extensively tested and compared to SPM, FSL, and 
AFNI by using a large number of freely available fMRI datasets. 
BROCCOLI is available as free software under GNU GPL3 and 
can be downloaded from github*. 

2. METHODS AND IMPLEMENTATION 

The typical analysis pipeline for fMRI data is compromised of 
image registration, image segmentation, slice timing correction, 
smoothing, and statistical analyses. The methods used for these 
different processing steps in BROCCOLI are described in this 
section, and implementation details are given at the end of the 
section. 

2.1. IMAGE REGISTRATION 

Image registration for fMRI is used to align an anatomical Tl 
volume to a brain template (e.g., MNI or Talairach), to align 
an fMRI volume to the anatomical Tl volume, and to per- 
form motion correction. The registration between the anatomical 
space and a standard brain space, often called spatial normal- 
ization, can be performed using a linear transformation model 
(e.g., affine or rigid) or by using a non-linear approach, which 
is much more computationally demanding. In a comparison of 
non-linear deformation algorithms for human brain MRI reg- 
istration (Klein et al, 2009), the DARTEL algorithm in SPM 
took an average of 71min to register a single Tl volume to 
the MNI template (Imm^ resolution) and the FNIRT algo- 
rithm in FSL used an average of 29 min. The AFNI software 
did not until recently have support for non-linear registration, 
but can now be achieved through the function 3dQwarp. Based 
on our benchmarking, non-linear registration with 3dQwarp 



''https://github.com/wanderine/BROCCOLI/ 



Table 1 | Hardware configuration and performance measures of the computer used for testing the different software packages. 

Device Processor cores Memory Single precision Double precision Memory bandwidth Price 

(GB) (GFLOPS) (GFLOPS) (GB/s) (USD) 



Intel Core i7-3770K 4 (8 with hyper threading) 16 1 core: 56, 4 cores: 224 1 core: 28, 4 cores: 112 26 330 

Nvidia GTX 680 1536 4 3090 129 192 500 

AMD Radeon 7970 2048 3 3790 947 264 500 

A Linux operating system was used (CentOS 6.4 64 bit) witti an OCZ 128 GB SSD fiard drive. Tfie tfieoretical performance for single (32 bit floats) and double (64 bit 
floats) precision is given as giga floating point operations per second (GFLOPS). Prices are from newegg.com and should be seen as approximate. 
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takes about 36 min with a single-threaded version of AFNI, and 
13min using the multi-threaded OpenMP version (for a CPU 
running 8 threads). Thus, depending on the algorithm, nor- 
malization for a study involving 30 subjects can take 5-35.5 h. 
Moreover, to obtain satisfactory results, it may be necessary to 
run the registration algorithm with a number of different set- 
tings. For these reasons, affine registration to a standard brain 
space is sometimes performed instead of a non-linear one, even 
though the non-linear approach can yield a better registration. 
Another time saving approach is to perform spatial normal- 
ization to a brain template of lower resolution, e.g., 2mm^ 
voxels, but this solution is less appealing, since spatial resolu- 
tion is sacrificed. Due to the computational challenges of image 
registration, GPU acceleration of such algorithms is very pop- 
ular with some 60 publications since 1998 (Shams et al, 2010; 
Fluck et al, 2011; Pratx and Xing, 2011; Ekiund et al, 2013a). 
GPUs can thus easUy be used in the neuroimaging field, to for 
example enable more widespread use of demanding non-linear 
registration algorithms. 

2. 1. 1. Linear image registration 

BROCCOLI uses a single registration algorithm to perform 
the three described registrations (Tl-to-MNI,fMRI-to-Tl, and 
motion correction). Here we summarize the algorithm, which has 
been previously described (Ekiund et al, 2010). The main idea 
of the algorithm is to use the optical flow equation (Horn and 
Schunck, 1981) 

A7, (1) 



where V7 is the gradient of the volume, v is a motion vector that 
describes the difference between the volumes and Al is the inten- 
sity difference between the two volumes. The aperture problem, 
however, prevents us from solving this equation directly, as there 
are three unknown variables (the motion in x, y, and z), but only 
one equation. Instead of solving the equation for each voxel sep- 
arately, one can minimize the expression over the entire volume. 
The total squared error can be written as 



^(V7(x,)^v(x,)- A7(x,))' 



(2) 



where x, denotes the position of voxel ;'. A linear model of 
the motion field can be used to represent a motion vector in 
each voxel. The motion field v(x) for affine transformations in 
3D can be modelled with a 12-dimensional parameter vector, 
P = [pi,p2,p3,~p4,p5,p6~p7,p&,p9,pio,pii,pi2V, and a base 
matrix B{x) according to (Hemmendorff et al., 2002) 



(3) 
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The first three parameters are the translations and the last nine 
parameters form a transformation matrix (if an identity matrix 
is added, as the parameter vector p used here only describes the 
difference between the two volumes). The variables x, y, and z are 
the coordinates of voxel x. By using the model of the motion field, 
v(x) = B(x) p , the error measure can be written as 



e^ = J2 {'^lix.fBix,) p - A/(x,))' . 



(4) 



The derivative of this expression, with respect to the parameter 
vector, is given by 



9e2 
dp 



(5) 



and setting the derivative to zero yields the following linear 
equation system 



J2bJvi,viJb,p = ^BfV/,A/„ 



(6) 



where A is a matrix of size 12x12 and /i is a vector of size 12x1. 
The best parameter vector can finally be calculated as 



p = A^h. 



(7) 



The system of linear equations is easy to solve, while the com- 
putationally demanding part is to sum over aU voxels. L2 norm 
minimization makes it possible to calculate the parameters that 
give the best solution. The solution can then be improved by iter- 
ating the algorithm and accumulating the parameter vector (to 
avoid repeated interpolation). The most common approach is 
otherwise to maximize a similarity measure by searching for the 
best parameters, using some optimization algorithm. To handle 
large differences between two volumes, it is common to start the 
registration on a coarse scale and then improve the registration by 
moving to finer scales. BROCCOLI uses three to four scales for the 
registration between Tl and MNI and between fMRI and Tl; the 
difference between each scale is a factor two in each dimension. 

The estimated affine transformation parameters can be 
restricted to a rigid transformation (i.e., translations and rota- 
tions only), and is accomplished in BROCCOLI by applying a sin- 
gular value decomposition (SVD) to the transformation matrix 
and then forcing the singular values to be one. Rigid registration 
is used for fMRI-Tl registration and for motion correction, while 
affine registration (12 parameters) is used for the Tl-MNI regis- 
tration. For the motion correction procedure, the rotation angles 
01,62, 6*3 are extracted from the estimated rotation matrix for 
each time point using the following formulas (Shoemake, 1994; 
Day, 2012) 
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Oi = atanlipi), pu), 

C2 = ^P4 ■ P4+P5 -pS, 

6*2 = atan2{-p6, ci), 
Si = sin{6i), 
Cl = cos(9i), 

6*3 = atan2{si ■ pw - ci ■ pj, ci ■ p^ - si ■ pu), 

where atan2(a, b) is the four quadrant arctangent of a and b. The 
main reasons for extracting the rotation angles are to use them as 
nuisance regressors in the statistical analysis and to present them 
to the user. 

2. 1.2. Non-intensity based image registration 

The registration algorithm used in BROCCOLI is not based on 
the image intensity directly, e.g., the image gradient as described 
above. Instead, the algorithm is based on matching edges to 
edges and lines to lines, by using the concept of local phase 
from quadrature filter responses (Granlund and Knutsson, 1995; 
Knutsson and Andersson, 2003). A quadrature filter is complex 
valued in the spatial domain; the real part is a line detector and the 
imaginary part is an edge detector. The local phase is the relation- 
ship between the real and imaginary filter responses and describes 
the type of local structure (e.g., a line or an edge), while the mag- 
nitude can be seen as a certainty measure of how likely it is that the 
filter detected a structure. The local phase concept is illustrated 
in Figure 1. The quadrature filters need to be created using filter 
optimization techniques, which simultaneously consider proper- 
ties in the spatial domain and the frequency domain (Granlund 
and Knutsson, 1995; Knutsson et al., 1999). In the presented equa- 
tions, the image gradient V7 is replaced with a phase gradient V(p 



and the image difference A/ is replaced with a phase difference 
A(p. The phase difference can be calculated as 

A(p = arg [qi ■ <j|) , (9) 

where qi and <j2 are the complex valued quadrature filter 
responses for the two volumes and * denotes complex conjuga- 
tion. A nice property of the local phase is that it is invariant to 
the image intensity (all edges are for example interpreted equally, 
regardless if the image intensity changes from 0 to 1 or from 10 
to 11), making it easier to register volumes from different modal- 
ities or volumes with different or varying contrast. Phase based 
optical flow was introduced in the field of computer vision (Fleet 
and Jepson, 1990) and eventually propagated to the medical imag- 
ing domain (Hemmendorff et al., 2002; Knutsson and Andersson, 
2005; Mellor and Brady, 2005). While phase based image reg- 
istration can in some cases be more robust against intensity 
differences (Hemmendorff et al., 2002; Mellor and Brady, 2005; 
Eklund et al., 20 1 lb), a drawback is that it requires filtering with a 
number of (non-separable) filters in each iteration, which is com- 
putationally demanding. Fortunately, CPUs are perfectly suited 
for parallel operations like filtering (Eklund and Dufort, 2014). 

2.1.3. Non-linear image registration 

As previously mentioned, non-linear methods can lead to 
a significantly better registration between a subject specific 
anatomical volume and a brain template. BROCCOLI uses the 
Morphon (Knutsson and Andersson, 2005; Forsberg et al, 2011; 
Forsberg, 2013) to perform non-linear registration. The Morphon 
is also based on phase based optical flow, and the two most impor- 
tant parts of the Morphon are, therefore, the same as for the linear 
registration algorithm; to apply a number of quadrature filters 
and to calculate phase differences. The main differences are that 
the linear algorithm uses three quadrature filters (oriented along 
x, y, and z) and solves one equation system for the entire volume, 
while the Morphon uses six quadrature filters (evenly distributed 
on the half sphere of an icosahedron) and solves as many equa- 
tion systems as there are voxels. The error being minimized in 
each voxel can be written as 

N , s 2 

e' = ^(cjtT(A^A-d)) , (10) 
k=\ ^ ' 

where Ai^j- is the phase difference between the two volumes for 
quadrature filter k, Ck is a certainty estimate for filter k, is the 
orientation vector for filter k, N is the number of quadrature fil- 
ters, d is the displacement vector to be optimized and T is a local 
structure tensor (Knutsson, 1989; Granlund and Knutsson, 1995; 
Knutsson et al., 201 1). A local structure tensor in image process- 
ing is analogous to a diffusion tensor in diffusion tensor imaging 
(DTI); it represents the magnitude and orientation of the sig- 
nal in each neighborhood. The tensor can be calculated from the 
six complex valued quadrature filter responses as (Granlund and 
Knutsson, 1995) 



Dark to bright edge 




Bright to darl< edge 

FIGURE 1 I This figure presents the main concept of local phase <p from 
quadrature filter responses. A quadrature filter is complex valued in the 
spatial domain; the real part is a line detector and the imaginary part is an 
edge detector. If the filter response only contains a real valued component, 
it means that the filter detected a line. If the filter response only contains 
an imaginary valued component, it means that the filter detected an edge. 
It is important to combine the local phase with the magnitude of the 
complex valued filter response, as the local phase does not have any 
meaning for a low magnitude. 
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where I is an identity tensor. The purpose of using the tensor in 
the error measure is to reinforce displacement estimates along the 
local predominant orientations (i.e., displacements perpendicular 
to edges and lines). Using an L2-norm, the best displacement vec- 
tor can be calculated for each voxel directly, by once again solving 
a linear system (of size 3x3), i.e.. 



k=l 



-1 N 



4AcpkT^Thk. 



(12) 



The estimated displacement field is regularized by applying 
Gaussian smoothing separately to each motion component 
(x, y, z) before it is used to warp the Tl volume. Just as for the 
linear registration, the displacement field is accumulated in each 
iteration to avoid repeated interpolation. An affine registration 
(12 parameters) is first estimated between the Tl volume and the 
MNI template before estimation of the non-linear displacement 
field. 

2.2. IMAGE SEGMENTATION 

SPM has several functions for segmenting brain volumes. FSL 
provides BET (brain extraction tool) and FAST (FMRIB's auto- 
mated segmentation tool) while AFNI provides the function 
3dSkullStrip. BROCCOLI performs skuUstripping by first regis- 
tering the Tl volume to MNI space, using an MNI template with 
skuU, then applies an inverse transform to the MNI brain mask 
and finally performs a multiplication between the transformed 
mask and the original Tl volume to obtain a skuUstripped version 
of the Tl volume. The skuUstripped Tl volume is then aligned to 
an MNI template without skull, to improve the alignment, and 
the MNI brain mask is again inversely transformed (using the new 
registration parameters) and multiplied with the original Tl vol- 
ume, to obtain a better skuUstrip. The fMRI data is segmented by 
first applying 4 mm 3D Gaussian smoothing to one of the fMRI 
volumes and then using a threshold that is 90% of the mean value. 

2.3. SLICE TIMING CORRECTION 

Slice timing correction is normally applied to fMRI data (Sladky 
et al., 2011), as the slices in each volume are collected at slightly 
different time points. BROCCOLI sets the middle slice as the ref- 
erence and then applies cubic interpolation in time to correct for 
the temporal difference between the slices. 

2.4. SMOOTHING 

fMRI data is frequently spatially smoothed. The non-linear reg- 
istration algorithm also uses Gaussian smoothing, for example 
to reguralize the tensor components and the resulting displace- 
ment field in each iteration. BROCCOLI utilizes a simple form 
of normalized convolution (Knutsson and Westin, 1993), called 
normalized averaging, to avoid problems with voxels close to the 
edge of the brain being influenced by voxels outside the brain. The 
normalized filter response nfr is calculated as 



nfr 



(v • c) * / 
c*f 



(13) 



where / is the filter, v is one fMRI volume, c is a certainty measure, 
* denotes convolution and • denotes pointwise multiplication. 



The certainty is simply the fMRI brain mask, such that the cer- 
tainty is one inside the brain and zero outside. If a gray matter 
segmentation is available, the same approach can be used to pre- 
vent similar problems with smoothing that includes values from 
other types of brain matter (by setting the certainty to one for 
gray voxels and zero for aU other voxels). 

2.5. STATISTICAL ANALYSIS 

The statistical analysis is the core of all fMRI software packages. 
The use of CPUs for statistical computations is a relatively new 
concept (Suchard et al., 2010; Guo, 2012) and can for exam- 
ple be used to speedup demanding Markov Chain Monte Carlo 
(MCMC) simulations (Lee et al, 2010). We believe that CPUs 
(or at least the computational capacity they confer) are a neces- 
sary component for incorporation of developments in the field 
of statistics to the field of neuroimaging, especially for high reso- 
lution fMRI data (Feinberg and Yacoub, 2012). By using GPUs, 
computationally demanding non-parametric tests can be used 
instead of parametric ones (Nichols and Holmes, 2002; Ekiund 
et al, 2011a) and MCMC based methods [e.g., (Woolrich et al., 
2004)] also become feasible (da Silva, 201 1). 

The SPM, FSL, and AFNI software packages are all mainly 
based on the general linear model (GLM) for first (subject) and 
second level (group) analyses, as proposed by Friston et al. (1994). 
The GLM can be written in matrix form as 



(14) 



where y are the observations for one voxel, are the parame- 
ters to estimate, X is the design matrix (model) containing all the 
regressors and e are the errors that cannot be explained by the 
model. As the GLM is applied to each voxel independently, it is 
perfectly suited for parallel implementations. By minimizing the 
squared error | |c | p, it can be shown that the best parameters (for 
independent errors) are given by 



p = (x^xj 



x'y. 



(15) 



A useful property of this expression is that the term (-X'-'^X) ^ 
is the same for aU voxels and can, thus, be precalculated. A t-test 
value can easily be calculated from the estimated weights as 



J$ - u 



var(€) cT(xTx)"^ 



(16) 



where c is a contrast vector, e is the residual of the GLM and u is 
a scalar for the null hypothesis P = u. An f -test value can in a 
similar manner be calculated as 



(cp -uV fvar (e) C (X^X) ^ cA ^ (cfi - u) 

F=S Z_V 1 V L (17) 

N 



where C is a contrast matrix and N is the number of contrasts. 
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2.5. 1. First level analysis 

The first level fMRI analysis starts with slice timing correction 
and motion correction. The estimated motion parameters (trans- 
lations and rotations) are included in BROCCOLI by default 
as additional regressors in the GLM design matrrs, to further 
reduce effects of head motion (Johnstone et al, 2006). Gaussian 
smoothing is applied to each fMRI volume and the GLM is finally 
applied to the smoothed volumes. In addition to motion regres- 
sors and regressors for the experimental design, the design matrix 
in BROCCOLI also contains regressors to remove the mean and 
trends that are linear, quadratic or cubic. The effect of using these 
additional regressors is similar to a highpass filtering. The GLM 
errors are for first level fMRI analysis often modelled as an auto 
regressive (AR) process, 

P 

6t = ^P,6t-, ■ + Wt, (18) 
j = 1 

where p is the order of the AR process, p, are the AR parame- 
ters and w is white noise with variance cr^. A Cochrane-Orcutt 
procedure (Cochrane and Orcutt, 1949) is used in BROCCOLI 
to estimate the beta weights for autocorrelated errors. The GLM 
weights P are first estimated using ordinary least squares (equa- 
tion 15) and then a voxel- wise AR model of the fourth order 
is used to model the residuals (Worsley et al., 2002). The AR 
parameters are estimated by solving the Yule-Walker equations 
independently for each voxel. Each volume of AR estimates is spa- 
tially smoothed with a 7 mm Gaussian filter to further improve 
the estimates (Woolrich et al, 2001; Worsley et al., 2002; Gautama 
and HuUe, 2004), before the actual whitening is applied to the 
smoothed fMRI data and the regressors in the design matrix (such 
that each voxel gets its own specific design matrix). The compo- 
nents of the whitened data y and the whitened regressors X are 
thus calculated as 

4 

yt = yt -"^piyt-i, (19) 
>= 1 

4 

^r,r=X,,,-^pA_,, „ (20) 
1= 1 

where p,- are the spatially smoothed AR estimates, r denotes 
regressor and t denotes time point. The whitened data y and the 
whitened regressors X are then used to estimate new beta weights, 
according to 

j8 = (Fx)"'Fy. (21) 

As a last step, the AR parameters are re-estimated using residuals 
calculated with the new weights fi, the original data y and the 
original regressors X. The Cochrane-Orcutt procedure is repeated 
three times to obtain good estimates of the GLM weights and the 
AR parameters. Finally, the statistical maps are calculated using 
the variance of the uncorrected residuals 6, obtained as 

(=y-Xp. (22) 



FSL uses a similar iterative approach to estimate a voxel-wise 
prewhitening matrix (Woolrich et al., 2001), with the exception 
that the spatial smoothing is done separately for different tis- 
sue types. The voxel-specific noise model used in BROCCOLI 
has been shown to yield more valid results than those obtained 
from SPM (Eklund et al, 2012a), which uses a global AR(1) 
model. After the first level statistical analysis, the results (e.g., beta 
weights) are transformed to MNI space, by combining the esti- 
mated registration parameters for Tl-to-MNI and fMRI-to-Tl 
transformations and the estimated displacement field from the 
non-linear registration. 

2.5.2. Second level analysis 

The second level analysis in fMRI is straightforward compared 
to the first level analysis, once all the first level results are in 
a common brain space. A group-wise f-test or _F-test can eas- 
ily be performed by using the same functions as for the first 
level GLM. BROCCOLI currently only supports conventional 
f-tests and _F-tests for second level analysis, but we plan to also 
include other types of analyses (e.g., where the variance of the 
beta estimates are used as weights) in future releases. 

2.5.3. Frequentist inference 

In contrast to other software packages for fMRI analysis, 
BROCCOLI is not based on parametric statistics. All p-values 
are instead calculated through non-parametric permutation 
tests (Dwass, 1957; Nichols and Holmes, 2002), both for first level 
and second level analyses. The main motivation is that paramet- 
ric statistics require several assumptions to be met for the results 
to be valid. In fMRI it is also necessary to correct for a large 
number of tests, due to the high spatial resolution. The multiple 
testing makes the parametric assumptions much more critical, as 
one has to move far along the tail of the null distribution. The 
SPM software relies on Gaussian random field theory (GRFT) 
to correct for the multiple testing (Worsley et al., 1992), while 
FSL mainly works with GRFT and non-parametric permutation 
tests (for group analyses only). AFNI instead uses the false discov- 
ery rate (FDR) (Genovese et al., 2002) and a cluster simulation 
tool. A permutation test solves the problem of multiple testing 
in a very simple way. In each permutation, only the largest value 
of the statistical map (e.g., the maximum f-test value, the max- 
imum f-test value, the size or mass of the largest cluster etc.) is 
saved to form the nuU distribution of the maximum test statistics. 
Corrected p-values are finally calculated as the proportion of val- 
ues in the estimated null distribution that are larger than or equal 
to the test value for the current voxel or cluster. A threshold for a 
certain significance level a, corrected for multiple testing, can be 
calculated by first sorting the estimated nuU distribution values, 
and then simply using the value that is larger than (100 — a) % 
of the values. The main problem is that a large number of per- 
mutations, normally 1000-10,000, are required to obtain a good 
estimate of the null distribution. Since a full statistical analysis 
needs to be performed in each permutation, the total process- 
ing time can be several hours or days for a single test, using 
conventional multi-core CPU implementations. This is the main 
reason why permutation tests are not standard procedure in the 
neuroimaging field. 
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For first level analysis in BROCCOLI, detrending and whiten- 
ing [using a voxel-wise AR(4) model as previously described] 
is applied to the motion corrected data and new fMRI data is 
then generated in each permutation by using an inverse whitening 
transform with randomly permuted whitened data. The smooth- 
ing has to be applied in each permutation, as the smoothing alters 
the autocorrelation structure of the £MRI data. Permutation test- 
ing for the second level analysis is much easier, as no whitening or 
smoothing is required. See our previous work for further infor- 
mation on the non-parametric analysis (Ekiund et al., 2011a, 
2012a). 

2.5.4. Bayesian inference 

The GLM model previously described can alternatively be ana- 
lyzed using Bayesian methods. A Bayesian analysis begins with a 
prior distribution p(fi, , p) over the model parameters and 
subsequently updates the prior with the observed data. The result 
is the posterior distribution p(j8, a^, p\X, y), which encapsu- 
lates all information about the unknown parameters conditional 
on the observed data. In fMRI, the brain activity can be visu- 
alized as a heat map of Pr(/J, > 0|X, y), commonly known as 
a posterior probability map (PPM) (Friston et al, 2002). The 
joint posterior p(/3, , p\X, y) is often not tractable in ana- 
lytical form, but can be approximated by different approaches. 
The most common approach in the fMRI field is to use approx- 
imation techniques like variational Bayes, where the posterior is 
factorized into several independent factors to obtain an analyti- 
cal expression (Penny et al., 2003). A less common approach is 
to use techniques based on Markov Chain Monte Carlo (MCMC) 
simulation. MCMC produces a sample from the posterior, and 
the probability of activity Pr(/J, > 0|X, y) can be approximated 
by the proportion of simulated /S; being larger than zero. The 
PPM for any contrast is also directly available from the pos- 
terior simulations. Note that since simulations are done using 
the joint posterior, PPMs are not conditional on point esti- 
mates of cr^ and p, leading to more accurate inferences regarding 
brain acitivity. 

BROCCOLI uses a specific MCMC algorithm, the Gibbs sam- 
pler, to generate draws from the posterior by iteratively sim- 
ulating from two full conditional posteriors. First, the auto- 
correlation parameters p are updated by simulation from 
p\fi, , y, X as a (multivariate) Gaussian distribution. Second, 
the variance cr^ is updated by simulation from a^lp, y, X 
as an inverse Gamma distribution and the GLM weights fi 
are finally updated by simulation from P\cr^, p, y, X as 
a (multivariate) Gaussian distribution. These conditional dis- 
tributions are obtained when the priors for p\a^ and p are 
Gaussian and the prior on is inverse Gamma. The exact 
details of each updating step can be found in most Bayesian 
textbooks, see e.g., Murphy (2012). Note that each updat- 
ing step conditions on the most recently simulated value 
for the conditioning parameters. While MCMC methods can 
theoretically be used to approximate any posterior, a com- 
mon problem is the significantly longer processing time com- 
pared to techniques like variational Bayes. BROCCOLI runs 
a large number of MCMC chains in parallel to reduce the 
processing time. 



2.6. IMPLEMENTATION 

We will here describe the implementation of BROCCOLI for the 
different algorithms. Readers are referred elsewhere for introduc- 
tions to GPU programming (Kirk and Hwu, 2010; Munshi et al., 
2011; Sanders and Kandrot, 2011). Most of the OpenCL code uses 
single precision to achieve maximum performance, while some 
host code uses double precision (to for example obtain the opti- 
mal affine registration parameter vector). The open source library 
Eigen""' ^ is used in BROCCOLI to perform matrix calculations on 
the host.^ 

2.6. 1. Image registration 

The described linear and non-linear registration algorithms are 
easy to run in parallel. The filtering operation applied in each 
iteration is the most demanding part, especially since quadrature 
filters are non-separable, and has therefore been carefully opti- 
mized. Filtering can be performed as a multiplication in the fre- 
quency domain, after the application of a fast Fourier transform 
(FFT) to the signal and the filter, or as a convolution in the spa- 
tial domain. BROCCOLI uses the convolution approach, for three 
reasons. First, the FFT approach requires an FFT library while 
the convolution approach can rather easily be implemented man- 
ually. The CUDA programming language provides the CUFFT 
library, and a similar OpenCL library called clFFT has recently 
appeared. However, clFFT is in our opinion not yet as mature as 
CUFFT. The user, for example, has to compile the whole project 
to obtain a library file. Second, a convolution approach often pro- 
vides high performance over a wide range of data sizes, while an 
FFT normally performs best for data sizes being a power of 2. 
Third, the convolution approach is less memory demanding as 
the FFT approach requires that the filters are stored as the size of 
the signal for an elementwise multiplication. 

Convolution is easy to run in parallel, and high performance 
can be achieved by taking advantage of the fact that the filter 
responses for neighboring voxels use mainly the same input data. 
An easy way to implement a non-separable 3D convolution is 
to take advantage of the texture memory, as the texture mem- 
ory cache can be used to speedup reads that are spatially local. 
Such an implementation wUl, however, be limited by the global 
memory bandwidth. A better approach is to take advantage of 
the local memory^ available in modern CPUs (CPUs do not nor- 
mally have local memory physically; it can instead be simulated by 
the OpenCL driver). By first reading values from global memory 
into local memory, all the threads in a thread block can repeatedly 
read from the local memory very efficiently. The Nvidia GTX 680 
has 48 KB of local memory per multiprocessor; it can for exam- 
ple store a 3D array of 32 x 32 x 12 float values. The quadrature 
filters used in BROCCOLI contain 7x7x7 coefficients, only 
26 X 26 X 6 = 4,056 filter responses will therefore be valid for 
each multiprocessor. The reason for this is that the convolution 
is undefined along a boundary of (N — l)/2 pixels for an N x 



^http://eigen.tuxfamiIy.org/index.php?title=Main_Page 
^https://bitbucket.org/ eigen/ eigen/ 

'^For readers not familiar with GPU programming, the CPU is often called the 
host while the GPU is called the device. 

^ Local memory in OpenCL is the same thing as shared memory in CUDA. 
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N kernel. The yellow pixels in Figure 2 illustrate the invalid fil- 
ter responses along the image borders for a filter size of 7 x 7. 
To maximize the number of valid filter responses per multipro- 
cessor, a better approach to non-separable 3D convolution is to 
instead perform non-separable 2D convolution on the GPU, and 
then accumulate the filter responses by calling the convolution 
kernel for each slice of the filter [i.e., instead of running all 6 for- 
loops (three for the data and three for the filter) on the GPU, run 
5 on the GPU and 1 on the CPU] . The local memory can for 2D 
be used to store two arrays of 96 x 64 float values, which instead 
gives a total of 10,440 valid filter responses per multiprocessor 
(two blocks of 90 x 58 pixels). The reason for using two arrays 
instead of one, is that each multiprocessor on the Nvidia GTX 
680 can concurrently run 2048 threads, but only 1024 threads per 
thread block. The 1024 threads per block are arranged as 32 along 
the x-direction and 32 along the y-direction, to for example fit 
the number of local memory banks (32). Each thread starts by 
reading 6 values from global memory into local memory (96 x 
64 / 1024 = 6) and then calculates 2 filter responses (giving two 
32 X 32 blocks). Three additional filter responses are then calcu- 
lated by most of the threads, yielding two blocks of 32 x 26 pixels 
and one block of 26 x 32 pixels. Finally, a number of threads are 
used to calculate the final 26 x 26 filter responses. The usage of 
local memory for non-separable 2D convolution is illustrated in 
Figure 2. As several quadrature filters need to be applied to the 
two volumes being registered (3 for linear registration and 6 for 
non-linear registration), 3 filters are applied simultaneosly once 



the data has been loaded into local memory. To achieve maxi- 
mum performance, the for-loops have been unrolled manually 
using a Matlab script. To run a short for-loop on a GPU can 
result in a sub-optimal performance, as it can take a longer time to 
setup the for-loop than to run it (this is especially true for nested 
for-loops). The filters are stored in constant memory, as they are 
used by all threads and since each multiprocessor has a constant 
memory cache. The filter responses are stored in thread specific 
registers. Note that calculating 6 filter responses per thread results 
in a much better ratio of memory operations and calculations, 
compared to a straight forward approach using texture memory 
(where each thread calculates a single filter response). Interested 
readers are referred to our previous work (Eklund and Dufort, 
2014) and our separate github repository' for further details. The 
AMD GPU and the Intel CPU used in our case have only 32 KB 
of local memory, the AMD GPU can also only run 256 threads 
per thread block. The code for these devices instead uses one local 
memory array of 128 x 64 pixels and calculates 120 x 58 filter 
responses in blocks of 16 x 16 pixels. 

The linear registration algorithm involves a summation over 
all voxels to setup an equation system (equation 6). BROCCOLI 
performs this summation using three kernels. The first kernel 
performs all the necessary multiplications and each thread cal- 
culates the sum for one voxel along the x-direction. The number 
of threads per thread block is equal to the width of the volume. 
The second kernel continues the summation along the y-direction 
(the number of threads per block is set to the depth of the volume) 
and the third kernel sums along z. The resulting equation system 
is finally copied to the host, to calculate the best parameter vector. 

Except for the filtering and the summation operation, the 
other required functions are straight forward to implement. 
For the linear registration algorithm, one kernel is used in 
BROCCOLI to calculate phase differences (equation 9) and cer- 
tainties and three kernels are used to calculate phase gradients 
V(p along X, y and z (Eklund et al., 2010). For the non-linear 
registration algorithm, one kernel is used to calculate the ten- 
sor components (equation 11), one kernel is used to setup the 
equation system in each voxel and one kernel solves the equation 
system (equation 12). Both the linear and the non-linear registra- 
tion algorithm use one additional kernel to interpolate from the 
volume being moved to match the template. The texture memory 
is used for these two kernels, as it has hardware support for linear 
interpolation in 1, 2, and 3 dimensions. For all these kernels, each 
thread performs the operations for one voxel. To make sure that 
the same code runs on both Nvidia and AMD CPUs, 256 threads 
per block are used. 

2.6.2. Smoothing 

The smoothing operation is also implemented as a convolution. 
As the Gaussian smoothing filters are Cartesian separable, three 
kernels are used to smooth along x, y, and z. Similarly to the 
non-separable convolution, local memory is used to obtain a 
more efficient implementation. The details of how the separable 
smoothing is performed will therefore not be given here. 



'https://github.com/wanderine/NonSeparableFilteringCUDA 




FIGURE 2 I The grid represents 96 x 64 pixels in local memory (each 
square is one pixel). As 32 x 32 threads are used per thread block, each 
thread needs to read 6 values from global memory into local memory [(96 
X 64)/(32 X 32) = 6], A yellow halo needs to be loaded into local memory 
to be able to calculate all the filter responses. In this case 90 x 58 valid 
filter responses are calculated, making it possible to apply at most a filter of 
size 7x7. The 90 x 58 filter responses are calculated as 6 runs, the first 2 
consisting of 32 x 32 pixels (marked light red and light blue). The 1024 filter 
responses (32 x 32) are calculated in parallel, and the gray squares 
represent three filter responses being calculated. Note that neighboring 
filter responses are calculated using mainly the same pixels. Three 
additional filter responses are calculated in blocks of 32 x 26 or 26 x 32 
pixels (marked green, dark blue and dark red). Finally, a block of 26 x 26 
pixels is processed (marked purple). The halo can easily be changed to 
handle larger filters. 
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2.6.3. Statistical analysis 

The statistical analysis of IMRI data is perfect for parallel process- 
ing; each thread performs the required calculations for one voxel. 
Just as for the registration kernels, all the statistical kernels use 256 
threads per block to fit both Nvidia and AMD GPUs. For first level 
analysis assuming independent errors and for second level anal- 
ysis, the pseudo inverse of the design matrbc [i.e., (-X^X) X''^] 
is calculated on the host and stored in constant memory (as it is 
the same for all voxels). Calculation of the beta weights for one 
voxel can then be simply performed as a number of scalar prod- 
ucts between the rows of the pseudo inverse and the data points 
of the current voxel (see equation 15). The resulting beta weights 
are stored as registers in each thread. However, each GPU thread 
can only handle a limited number of variables, BROCCOLI cur- 
rently therefore supports a maximum of 25 regressors. To simply 
loop over the number of regressors may result in suboptimal per- 
formance, for two reasons. The first reason is that if the index 
to the beta array is not known at compile time, e.g., beta[i], the 
compiler may put beta in global memory instead of registers. The 
second reason is that short for-loops are inefficient on GPUs (as 
mentioned in the filtering implementation). For optimal perfor- 
mance, BROCCOLI instead uses a switch-case approach to first 
determine the number of regressors being used. The code for each 
case is also unrolled, such that all accesses to the beta array are 
known at compile time. To calculate the f-test or f -test value effi- 
ciently in each voxel, some additional values, e.g., {^^^) 
from equation 16, are also pre-calculated and stored in constant 
memory. A limitation of the described approach is that the con- 
stant memory is normally only 32-128 KB; it can thus not store 
arbitrary large design matrices. A potential solution to this prob- 
lem is to instead use texture memory, and take advantage of the 
texture memory cache instead of the constant memory cache. 

The Cochrane-Orcutt procedure is harder to implement, as 
each voxel then uses a specific design matrix (after whitening 
according to equation 20). To calculate a pseudo inverse in each 
thread is problematic, as a design matrix for first level analy- 
sis easily can contain 200 timepoints and 15 regressors. Such an 
operation would thus require at least 3000 floats per thread, far 
outstripping the capabilities of some contemporary devices. For 
example, the Nvidia GTX 680 can handle only 63 floats per thread 
in its registers. Additional floats will spill into slow global mem- 
ory (called local memory in CUDA), which may degrade the 
performance significantly. GPUs that have a LI and/or L2 cache 
may be able to still use a larger number of registers efficiently. A 
possible solution could be to instead use the updating formula 
derived for MCMC (equation 24), but such an approach can also 
require a large number of registers [e.g., 40 registers for the my 
variables for 10 regressors and an AR(1) model]. The current 
solution is to instead calculate all the pseudo inverses on the host 
and then copy them to slow global memory. For these reasons, 
the Cochrane-Orcutt procedure is not yet optimized in terms of 
speed. Permutation testing for first level analysis therefore cur- 
rently uses the simpler approach assuming independent errors. 
The permutation based p-values wiU still be valid, as the same 
analysis is applied in each permutation (whitening is applied prior 
to the permutations, and the autocorrelation is then put back in 
each permutation). 



The whitening operation that is applied prior to the single 
subject permutations, and in the Cochrane-Orcutt procedure, 
requires that an AR model is estimated for each voxel. To accom- 
plish this, each thread loops over time and sets up the Yule-Walker 
system of equations for one voxel. The AR(4) parameters are 
then calculated by directly solving these equations using a matrbc 
inverse. One limitation of this approach is that more advanced 
AR models [e.g., an AR(8) model] requires a larger number of 
registers, both to store the parameters and to calculate the matrbc 
inverse. For the inverse whitening applied in each permutation, to 
generate new data, all the threads also loop over time to generate 
new time series. 

Permutation tests involving cluster based inference require that 
a clustering operation is performed in each permutation, to cal- 
culate the extent or mass of the largest cluster. BROCCOLI uses 
the parallel label equivalence algorithm proposed by Hawick et al. 
(2010) for this purpose. The algorithm is implemented as five 
kernels. The first kernel assigns an unique starting label to each 
voxel that survives the initial voxel-wise threshold (e.g., p = 0.01, 
uncorrected for multiple comparisons). In the second kernel each 
voxel checks its 26 neighbors to see if there is a label with a lower 
value. If a lower label is found, the label of the center voxel is 
updated and an update flag is set to 1. The third kernel resolves 
label equivalences, in order to minimize the number of times the 
second kernel has to be launched [see Hawick et al. (2010) for 
details]. The second and third kernels are launched repeatedly, 
until the update flag is no longer set to 1 . To calculate the size of 
each cluster, a fourth kernel is applied where each thread atom- 
ically increments a cluster specific counter (determined by the 
cluster label). Finally, a fifth kernel is used to obtain the size of 
the largest cluster; the implementation relies on the atomic max 
operation. 

The Bayesian MCMC algorithm can with careful memory 
management lead to a substantial time reduction compared to a 
sequential approach. To see the importance of memory manage- 
ment, consider simulating from the fuU conditional posterior of 

and cr^. Conditional on p, this is a standard linear regression 
update on the transformed model 

y = Xfi + e, (23) 

where X and y are obtained by pre-whitening X and y with the 
most recently simulated coefficients in p (as described in equa- 
tions 19 and 20). Since p changes in every /O-update, both X 
and y need to re-computed in each iteration of the Gibbs sam- 
pler. Both X and y are, however, too large to be stored in the 
fastest GPU memory (thread specific registers), and the cost of 

repeatedly accessing data from slower memory can be very large. 

« 7 « 

To solve this problem, BROCCOLI instead updates X X after a 
change in p, according to 

P P 

x'^X=J2J2P'Pj^-J' (24) 

j = Oj = 0 

where we for convenience define po = — 1 , p is the order of the AR 
model and 5y = J2^= i - ■ are data matrices independent 
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of p (Xf is a vector that contains all the regressors for time point 
f, while X is the full design matrrK). Note that Sy = Sj, and that 
all 5y are symmetric. For a first order AR model, the update is 
given by 

x'~X = Soo-2pSoi + p^Sii. (25) 

Note how the data matrices Sy are separated from p in the 
above expressions. The Sy matrices are not voxel-specific and 
can, therefore, be pre-computed and stored in constant mem- 

ory. Analogous formulas are easily derived for X y (with data 
moments my = XifL i ^t-ift-]) and y^y (with data moments 
^y = XifL 1 yr- iTf-;); both of which are needed for the Gibbs 
sampling. The niy and the g,-^ values are voxel-specific, but 
low-dimensional and can therefore be stored in thread specific 
registers. Despite these optimizations, the implementation can 
currently only handle a small number of regressors and an AR(1) 
model of the residuals. The extension to more elaborate models 
is in principle straight forward however, and rapid advancements 
in GPU memory are likely to remove these limitations in the near 
future 

The Bayesian fMRI analysis also requires random number gen- 
eration to estimate the joint posterior distribution. The CURAND 
library can be used for this purpose for the CUDA program- 
ming language, but there exists no similar library for OpenCL. 
Instead, random numbers are first generated in BROCCOLI from 
a uniform distribution, using a voxel/thread specific seed and a 
modulo operation (Langdon, 2009). This is the only part of the 
OpenCL code that currently uses double precision. The seeds 
are generated on the host side, as this operation only needs to 
be performed once. The uniformly distributed numbers are then 
used to generate numbers from a normal distribution, by apply- 
ing the Box-Muller transform (Box and MuUer, 1958). Random 
numbers from an inverse Gamma distribution can finally be 
generated as 



where m is a random number from a normal distribution with 
zero mean and unit variance, A is the shape parameter of the 
Gamma distribution and B is the scale parameter. 

3. RESULTS 

A number of freely available fMRI datasets (Biswal et al., 2010; 
Poldrack et al., 2013) were used to test our software, and to 
compare it to existing software packages. The hardware used for 
testing is specified in Table 1. Specifically, BROCCOLI was used 
with an Intel CPU, an Nvidia GPU and an AMD GPU, to demon- 
strate that the same code can run on different types of hardware. 
The following software packages were compared to BROCCOLI: 
SPM8, FSL 5.0.4 (Smith et al, 2004) (with the package Condor 
installed for parallel processing) and AFNI (Cox, 1996) (with 
OpenMP support for parallel processing). For FSL, the shell vari- 
able FSLPARALLEL was set to "condor" to measure multi-core 
resuhs. For AFNI, the shell variable OMP_NUM_THREADS was 



set to "1" to generate processing times for single-core processing, 
and to "8" for multi-core processing. BROCCOLI running on 
a CPU automatically uses all available processor cores for all 
processing steps. AH testing scripts can be downloaded from 
github To make the comparison reflective of each package's 
standard use, our testing scripts were posted on the mailing lists 
for SPM, FSL, and AFNI and modified according to responses. 

It should be stressed that the different software packages use 
different algorithms, programming languages and libraries. It is 
therefore hard to make a quantitatively meaningful performance 
comparison. For this reason, we also added the processing time 
for BROCCOLI running on a single CPU core, such that there 
is a baseline comparison for each algorithm. This was achieved 
by setting the shell variable CPU_MAX_COMPUTE_UNITS to 
1 (a more general and complicated way is to use OpenCL device 
fission). 

3.1. SPATIAL NORMALIZATION 

The quality of the normalization to MNI space was tested by 
aligning 198 Tl-weighted volumes to the MNI brain templates (1 
and 2 mm-'' resolution) provided in the FSL software (MNI152_ 
Tl_l mm_brain.nii.gz, MNI152_T1_2 mm_brain.nii.gz). The Tl 
volumes were downloaded from the 1000 functional connec- 
tomes project (Biswal et al., 2010), and the Cambridge dataset 
was selected for its large number of subjects. Each Tl volume is of 
the size 192 x 192 x 144 voxels with a resolution of 1.2 x 1.2 x 
1.2 mm. To fully focus on the registration algorithm, the provided 
skuUstripped Tl volumes were used rather than the original Tl 
volumes. 

For SPM the functions "Normalize" and "Segment" were used 
for normalization. For "Normalize," the parameter 'Source image 
smoothing' was changed from 8 mm to 4 mm, to try to match the 
smoothness of the FSL Tl template (the Tl template in SPM is 
more blurred than the Tl template in FSL). For 'Segment', an ini- 
tial parametric alignment of each Tl volume was first performed 
using the function 'Coregister' (otherwise several normalized Tl 
volumes were far off from the MNI template). Except for these 
modifications, the default settings were used. For FSL, the Tl vol- 
umes were aligned by running FLIRT (which performs linear reg- 
istration) using the skuUstripped volume and template, followed 
by FNIRT (which performs non-linear registration) using the vol- 
ume and template with skull (this is the recommended usage). 
The estimated deformation field was finally applied to the skuU- 
stripped volume. For registration to the 2 mm^ MNI template, the 
configuration file "T1_2_MNI152_2 mm.cnf " was used, whUe the 
default settings were used for registration to the 1 mm' template 
(there is no "T1_2_MNI152_1 mm.cnf"). For AFNI, alignment 
was performed correspondingly by running 3dUnifize (which 
normalizes the image intensity) both for the Tl volume and the 
MNI template, 3dAllineate and 3dQwarp. The estimated displace- 
ment field was finally applied to the original Tl volume without 
intensity normalization, using the function 3dNwarpApply. The 
default interpolation method for 3dNwarp Apply is sine interpola- 
tion, but as SPM, FSL and BROCCOLI all use linear interpolation 
by default, 3dNwarpApply was tested with linear as well as sine 



'https://github.com/wanderine/BROCCOLI/tree/master/code/testing_scripts 



Frontiers in Neuroinformatics 



www.frontiersin.org 



March 2014 | Volume 8 | Article 24 | 10 



Ekiund et al. 



Ultra-fast fMRI analysis using BROCCOLI 



interpolation. The non-linear registration in 3dQwarp is done 
with a combination of cubic and quintic basis functions, and it is 
not possible to change this to linear interpolation. Since 3dQwarp 
in AFNI is a very new method, we used settings proposed in the 
help text. For all software packages, the same settings were used 
for each Tl volume. 

Average normalized Tl volumes were calculated for SPM, FSL, 
AFNI, and BROCCOLI, to visually compare the algorithms, and 
are given in Figure 3. It should be noted that for FSL, the resulting 
displacement field from the 2 mm^ normalization was upscaled 
and used to generate the normalized Tl volumes used here (as 
recommended by the FSL mailing list). For a more numerical 
comparison of the image registration quality, the normalized 
cross-correlation, mutual information and sum of squared dif- 
ferences were calculated between each normalized Tl volume and 
the MNI template, the mean results are given in Figure 4. Only 
the voxels inside the MNI brain mask were used to calculate these 
similarity measures, as 75% of the voxels are outside the brain. 
The processing time for the different software packages are given 
in Figure 5. 

3.2. MOTION CORRECTION 

The motion correction algorithms in SPM (realign), FSL 
(MCFLIRT), AFNI (3dvolreg), and BROCCOLI were tested by 
using test datasets with known motion parameters. The test 
datasets were generated by repeatedly using only the first fMRI 
volume in each dataset and applying known random rigid trans- 
formations to this first volume. The translations and rotations 
were independently generated from a normal distribution with 



a mean of 0 and a standard deviation of 0.5 (voxels for transla- 
tions and degrees for rotations). Gaussian white noise was then 
added to each volume. To further demonstrate the robustness 
of BROCCOLI's phase based algorithm, a shading was added to 
each transformed fMRI volume. An example of the added shad- 
ing is given in Figure 6. The test datasets were created using the 
198 resting state datasets in the Cambridge dataset (Biswal et al., 
2010). Each rest dataset is of the size 72 x 72 x 47 x 119 with a 
voxel resolution of 3 x 3 x 3 mm. 

For SPM and AFNI, the algorithms were tested with lin- 
ear interpolation in addition to the default setting (b-spline for 
SPM and Fourier for AFNI), as FSL and BROCCOLI use linear 
interpolation as default. For SPM and FSL, the reference vol- 
ume was set to the first volume, which is the default for AFNI 
and BROCCOLI. Except for these changes, the default settings 
were used for all software packages. The quality of the motion 
correction was evaluated by comparing the estimated transfor- 
mations to the true ones. For each dataset, the total error was 
calculated as the square root of the sum of the squared dif- 
ferences over all motion parameters p and time points f, i.e.. 



117 U y' 

^ ^ ( motiotiestimatedit, p) - motiorit, 



e(t,p) 



\ t=lp=l 



(27) 

The mean error measures for the different software pack- 
ages, averaged over the 198 subjects, are given in Figure 7 
and the processing times for motion correction are given in 
Figure 8. 




FIGURE 3 I A visual comparison of spatial normalization with the 
different software packages, by averaging 198 normalized T1 volumes. 

(A) MNI template, (B) SPM Normalize average normalized Tl volume, (C) 
SPM Segment average normalized Tl volume, (D) FSL average normalized 
Tl volume, (E) AFNI average normalized Tl volume, (F) BROCCOLI average 
normalized Tl volume. Note that AFNI uses a combination of cubic, quintic, 
and sine interpolation as default, while SPM, FSL, and BROCCOLI all use 
linear interpolation as default. 
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FIGURE 4 I Similarity measures between each normalized T1 volume 
and the MNI template for the different software pacl<ages, averaged 
over 198 subjects. The error bars represent the standard deviation. NCC 
stands for normalized cross correlation (higher is better). Ml stands for 
mutual information (higher is better) and SSD stands for sum of squared 
differences (lower is better). For visualization purposes, the SSD similarity 
measure was divided by 300,000. Only the voxels in the MNI brain mask 
were used to calculate these similarity measures, as 75% of the voxels are 
outside the brain. Different interpolation modes were tested, as the 
software packages have different default settings for interpolation (d 
denotes the default interpolation). 
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FIGURE 5 I Processing times for non-linear spatial normalization of 
one T1 volume of size 192 x 192 x 144 voxels to a IVINI template (1 and 
2 mm^ resolution) for the different software packages, averaged over 
198 T1 volumes. The error bars represent the standard deviation. Note that 
AFNI uses a combination of cubic, quintic, and sine interpolation as default, 
while SPM, FSL, and BROCCOLI all use linear interpolation as default. A 
linear registration was first applied to achieve a good starting point for the 
non-linear registration. BROCCOLI running on a GPU can perform non-linear 
normalization to a 1 mm-' template in 4-6 s, and still provide a satisfactory 
result. BROCCOLI running on a CPU is also significantly faster than FSL and 
AFNI OpenMP even if a single CPU core is used. 




FIGURE 6 I Left: One slice of one fMRI dataset used for testing the motion 
correction algorithms. Right: The same slice after the application of a 
random translation and rotation, and addition of a shading (gradient) 
increasing upwards. The shading will affect all algorithms that use the 
image intensity directly. The phase based algorithm used in BROCCOLI 
will, however, not be affected by this shading. The main reason for this is 
that quadrature filters are bandpass filters, which remove low frequency 
variations (e.g., shadings) as well as high frequency variations (e.g., noise). 



3.3. FIRST LEVEL ANALYSIS 

The first level analysis was tested by analyzing fi-eely available task 
fMRI datasets, downloaded fi-om the OpenfMRI (Poldrack et al, 
2013) homepage. Specifically, the OpenfMRI "rhyme judgment" 
dataset was used where the subjects were presented with pairs of 
either words or pseudowords, and made rhyming judgments for 
each pair. See the work by Xue and Poldrack (2007) for further 
information about this dataset. 

3.3. 1. Frequentist inference 

To the best of our knowledge, the SPM software package does 
not have any default processing pipeline. Instead, we used a batch 
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FIGURE 7 I Motion parameter errors for the different software 
packages, averaged over 198 datasets with artificial motion. The error 
bars represent the standard deviation. The testing datasets were generated 
by applying random translations and rotations to the first fMRI volume in 
each dataset, and then adding Gaussian noise or a shading. The amount of 
noise was defined by setting the standard deviation to a percentage of the 
maximum intensity value. Different interpolation modes were tested, as the 
software packages have different default settings for interpolation (d 
denotes the default interpolation). The presented results were generated 
with an Nvidia GPU, and equal results were also obtained by the Intel CPU 
and the AMD GPU. 



script for first level analysis available on the SPM homepage". 
For FSL, the analysis was setup and started through the graphi- 
cal user interface. For AFNI, the Python script afni_proc.py was 
used, through the graphical interface uber_subject.py. The set- 
tings used for each software are given in Table 2. Processing times 
for first level analysis for the different software packages are given 
in Figure 9. A visual comparison of one brain activity map, for 
BROCCOLI and FSL, is given in Figure 10. Processing times for 
BROCCOLI for a first level permutation-based analysis, using 
10,000 permutations, are given in Figure 11. 

3.3.2. Bayesian inference 

The Bayesian fMRI analysis was tested by generating a total of 
1 1,000 draws from the posterior distribution for each brain voxel 
(44,220 voxels), and the first 1000 draws were discarded as "burn 
in" samples. The PPM was calculated as the percentage of draws 
where the GLM weight of interest was larger than zero. The result- 
ing PPM is given in Figure 12, and can be compared to the t-map 
in Figure 10. The processing time was 4706 s using the Intel CPU 
and one core, 835 s using the Intel CPU and all the four cores, 
190 s for the Nvidia GPU and 91 s for the AMD GPU. This can be 
compared to about 20 h for a naive Matlab implementation. 

3.4. SECOND LEVEL ANALYSIS 

To test the second level analysis, the permutation functional- 
ity in BROCCOLI was compared to the function randomize 

' ^ http://www.fiI.ion.ucLac.uk/ spm/ data/face_rep/ face_rep_spm5_batch.m 
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in FSL (SPM and AFNI do not have any support for permu- 
tation based analysis, although AFNI for example has support 
for Kruskal-Wallis tests and WUcoxon tests). The function ran- 
domize_paraUel in FSL automatically divides the number of 
permutations to a number of computers or CPU cores (if for 
example Condor or GridEngine is installed), and was therefore 
also used for testing. First level results generated by FSL (down- 
loaded from the OpenfMRI homepage) were used as inputs to 
the second level analysis, to fully focus on the permutation pro- 
cedure. Here we used the OpenfMRI dataset "word and object 
processing", as it has the largest number of subjects (49). See the 
work by Duncan et al. (2009) for further information about this 
dataset. Processing times for FSL and BROCCOLI for a second 
level permutation-based analysis of the 49 subjects, using 10,000 
permutations, are given in Figure 13. Null distributions gener- 
ated by FSL and BROCCOLI, for a design matruc containing a 
single regressor, were compared numerically and were found to 



be equivalent. A direct comparison for more than one regres- 
sor is more problematic, as the randomize function in FSL first 
transforms the design matrix to effective regressors and effec- 
tive confound regressors, by using information from the contrast 
vector. 

4. DISCUSSION 

We have presented a new software package for fMRI analysis. 
BROCCOLI is written in OpenCL, making it possible to run the 
analysis in parallel, taking full advantage of a large variety of 
hardware configurations. To exemplify this, BROCCOLI has been 
tested with an Intel CPU, an Nvidia GPU and an AMD GPU. 
The main objective of BROCCOLI is to demonstrate the advan- 
tages of parallel processing and to enable the neuroimaging field 
to avail itself of more computationally demanding normalization 
algorithms, and statistical methods that are based on a smaller 
number of assumptions (e.g., by using non-parametric statistics). 
Currently, BROCCOLI reduces the fMRI processing time by at 
least an order of magnitude compared to existing software pack- 
ages (even if only a CPU and not a GPU is used). For non-linear 
spatial normalization, BROCCOLI running on an Nvidia GPU is 
approximately 525 times faster compared to FSL and AFNI, and 
195 times faster than AFNI OpenMP. For second level permu- 
tation tests, BROCCOLI using an Nvidia GPU is 100-200 times 
faster than FSL and 33-130 times faster than the parallel version 
of FSL. 

4.1. SPATIAL NORMALIZATION 

The accuracy measures illustrated in Figures 3 and 4 reveal a 
number of interesting differences. The normalization in AFNI 
yields the highest mean correlation and mutual information. It 
might seem non-intuitive that the sine interpolation in AFNI 
gives a higher sum of squared differences compared to the lin- 
ear interpolation, but this is possibly explained by the fact that 
the sine interpolation preserves high resolution details, perhaps 
beyond the meaningful resolution of the MNI template. The aver- 
age normalized Tl volumes generated by SPM are clearly the most 
blurred, although the algorithms are fast compared to FSL and 
AFNI. The results presented here are consistent with a previous 
comparison (Klein et al., 2009), where the FSL function FNIRT 
was shown to provide better normalizations than the SPM func- 
tions "Segment" and "Normalize." AFNI was not included in this 
comparison, as the function 3dQwarp was released recently. 

These comparisons should not be considered as a thor- 
ough head-to-head evaluation of the different software packages. 
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Processing time for motion correction of one dataset 
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FIGURE 8 I Processing times for motion correction of one fIVIRI dataset 
of size 72 X 72 X 47 X 119 for the different software packages, 
averaged over 198 datasets. Tlie error bars represent tlie standard 
deviation. All algoritlims registered all volumes to the first one. Tlie 
processing times for AFNI and AFNI OpenMP are the same, as the AFNI 
software does not have any OpenMP support for motion correction. 
Different interpolation modes were tested, as the software packages have 
different default settings for interpolation (d denotes the default 
interpolation). 



Table 2 | Settings for first level analysis for the different software packages (for AFNI it is currently not possible to select non-linear 
registration in the graphical user interface). 



Normalization 



Motion 



Motion 
regressors 



Smoothing 
(mm) 



Cluster 
simulation 



Modeling of 
GLM residuals 



SPM Linear -h non-linear to MNI template Yes Yes, 6 

FSL Linear -h non-linear to MNI template Yes Yes, 6 

AFNI Linear to MNI template Yes Yes, 6 

BROCCOLI Linear -F non-linear to MNI template Yes Yes, 6 



Not available 
Not available 

No 

Not available 



Global AR(1) 
FILM prewhitening 
(Woolrich etal., 2001) 
Voxel-wise ARMAd, 1) 
Voxel-wise AR(4) 
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Processing time for first level analysis of 13 subjects, using non-linear normalization 
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FIGURE 9 I Processing times for first level analysis of 13 fMRI datasets 
(of size 64 X 64 X 33 X 160). The analysis includes non-linear 
normalization to a brain template, slice timing correction, motion correction, 
smoothing, and statistical analysis. A Matlab script, available on the SPM 
homepage, was used for SPM. For FSL, the analysis was setup and started 
through the graphical user interface. For AFNI, the analysis was performed 
with afni_proc.pY, through the graphical user interface uber_subiect.py. It 
should be noted that SPM, FSL and BROCCOLI use linear and non-linear 
registration, while AFNI uses linear registration only (currently, it is not 
possible to select non-linear registration in uber_subject.py). To compensate 
for this, the non-linear registration for AFNI was done separately. Note that 
it is not possible to select a 2 mm-^ brain template in uber_subject.py, these 
processing times are therefore not defined. Also note that the processing 
times for BROCCOLI do not include any first level permutation test. 




FIGURE 10 I Brain activity maps (representing t-values) from first level 
analysis of one OpenflVIRI dataset, for BROCCOLI and FSL. Subjects 
were presented with pairs of either words or pseudowords in a block based 
design, and made rhyming judgments for each pair. The first level analysis 
here includes motion correction, segmentation of the fMRI data, 
smoothing, and statistical analysis. Both BROCCOLI and FSL used motion 
regressors in the statistical analysis. As BROCCOLI and FSL use different 
models of the GLM residuals, we here present activity maps with and 
without whitening. The activity maps have been arbitrarly thresholded at a 
t-value of 5. 



Rather, the motivation was to show that BROCCOLI can provide 
a satisfactory normaUzation to MNI space in a short amount of 
time. An aspect not considered here, for example, is the smooth- 
ness of the resulting displacement fields. It is also possible that the 
different algorithms would perform better if the default settings 
were changed. 

4.2. MOTION CORRECTION 

The evaluation of the motion correction algorithms shows that 
BROCCOLI yields the smallest difference between the true 
motion parameters and the estimated ones, closely followed by 
AFNI. BROCCOLI using a GPU and AFNI perform the motion 
correction in a similar amount of time, while SPM and FSL 
are significantly slower. For BROCCOLI running on a CPU, the 
processing time is rather long, which is mainly explained by 
the fact that three (non-separable) quadrature filters need to be 
applied for each time point and for each iteration (3-5 itera- 
tions of the linear registration algorithm is normally sufficient 
for motion correction). BROCCOLI also estimates 12 affine reg- 
istration parameters for each time point, and then restricts them 
to a rigid transformation (6 parameters). The results presented 
here are consistent with a previous comparison of motion correc- 
tion algorithms (Oakes et al., 2005), where the AFNI software was 
shown to provide the most accurate motion estimates. 

It should be noted that the test used here is not based on real- 
istic head motion, as completely random transformations were 



applied for each time point. This can, for example, negatively 
effect the MCFLIRT function used in FSL. The reason for this is 
that MCFLIRT uses the motion estimate from the previous time 
point as a starting estimate for the next time point (Jenkinson 
et al, 2002). Similarly, 3dvolreg in AFNI is only intended for small 
motions, and the transformations applied here may have been too 
severe. The shading test is also not very realistic, but clearly shows 
the robustness of phase based registration algorithms compared 
to intensity based algorithms. For these reasons, the presented 
results should be interpreted with caution. 

4.3. FIRST LEVEL ANALYSIS 
4.3.1. Frequentist inference 

The first level analysis using FSL and BROCCOLI yield very sim- 
ilar results, both with and without pre-whitening to correct for 
auto correlation in the GLM residuals. The small differences in 
activation between FSL and BROCCOLI can be explained by a 
number of factors. The motion correction algorithms, for exam- 
ple, provide slightly different results according to Figure 7 and 
this wiU affect further processing. There are also some differences 
in how FSL and BROCCOLI setup the design matrix and treat 
the auto correlation of the GLM residuals. BROCCOLI uses four 
detrending regressors (mean, linear trend, quadratic trend, cubic 
trend) while FSL instead applies a temporal filtering to the data 
and the regressors. BROCCOLI smooths all the AR estimates in 
the same way, while FSL separately smooths AR estimates in white 
and gray brain matter (Woolrich et al., 2001). 
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FIGURE 11 I Processing times for BROCCOLI for first level analysis using 
a permutation based t-test with 10,000 permutations (SPM, FSL, and 
AFNI do not provide any functions for first level permutation based 
analysis). Left: Voxel-level inference, the maximum f-test value is saved in 
each permutation. Right: Cluster-level inference, the extent of the largest 
cluster is saved in each permutation. A t-value of 3 was used as a cluster 
defining threshold. The data used is of the size 64 x 64 x 33 x 160. A brain 



mask was used to only perform the statistical calculations for the brain 
voxels. Note that these processing times do not include smoothing in each 
permutation. Smoothing the fMRI data 10,000 times takes about 8970s 
using one core on the Intel CPU, 2710 s using all the four cores on the Intel 
CPU, 335s with the Nvidia GPU and 550s with the AMD GPU. Also note that 
ordinary least squares is used to estimate the GLM beta weights in each 
permutation, and not the more demanding Cochrane-Orcutt procedure. 




FIGURE 12 I A posterior probability map (PPM) from a Bayesian first 
level analysis of one OpenflVIRI dataset. Subjects were presented with 
pairs of either words or pseudowords in a block based design, and made 
rhyming judgments for each pair. The first level analysis here includes 
motion correction, segmentation of the fMRI data, smoothing, and 
statistical analysis. The PPM represents the probability of the first GLM 
beta weight being larger than zero, and has been arbitrarly thresholded at a 
probability of 0.99. Note that the PPM has been calculated by using a Gibbs 
sampler, and not by using techniques based on variational Bayes. Also note 
that the frequentist approach uses a voxel-wise AR(4) model of the GLM 
residuals, while the Bayesian currently uses a voxel-wise AR(1) model (due 
to hardware limitations). 



BROCCOLI is significantly faster than SPM, FSL, and AFNL 
even when the analysis is run on a CPU. SPM is also faster 
than FSL and AFNL which is mainly explained by a faster spa- 
tial normalization. The parallel version of FSL, where one first 
level analysis in our case runs on each CPU thread, is significantly 
faster than the non-parallel version. However, as the first level 



analysis in FSL requires more than 2 GB of memory, we were only 
able to run 6 (instead of 8) threads in parallel (since the computer 
used for testing has 16 GB of memory). 

4.3.2. Bayesian inference 

The Bayesian first level analysis yields results that are similar to 
the f-maps, although the results cannot be compared directly. It 
might seem confusing that the AMD GPU is faster than the Nvidia 
GPU, especially since the Nvidia GPU is faster for permutation 
tests. The reason for this is that the random number generation 
currently uses double precision, and the AMD GPU used in our 
case has better support for such calculations than the Nvidia GPU 
(see Table 1). 

4.4. SECOND LEVEL ANALYSIS 

The processing times in Figure 13 for the second level permuta- 
tion test may at first appear confusing. The speedup of using ran- 
domize_parallel instead of randomize decreases with the number 
of regressors, from a speedup of 3.2 for a single regressor to 1.6 
for 25 regressors (but the actual time saved increases). The 10,000 
permutations are divided into smaller work items of 300 per- 
mutations each for randomize_parallel. However, 33 work items 
cannot be divided equally to a CPU running 8 threads (8 threads * 
4 work items per thread = 32 work items). The permutation test 
is therefore not completed until the last work item has been pro- 
cessed, for which only a single CPU thread is active. The unequal 
division is more problematic for more regressors, as each work 
item then takes a longer time to process. 

The processing time for BROCCOLI is not affected as much 
by the number of GLM regressors as the FSL software is, result- 
ing in a larger speedup for a larger number of regressors. A 
GPU thread that performs a small number of calculations is 
very limited by the memory bandwidth. More regressors lead 
to more calculations and, thereby, a better utilization of the 
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FIGURE 13 I Processing times for second level analysis using a 
permutation based t-test with 10,000 permutations, for BROCCOLI 
and FSL (the SPM and AFNI software packages do not provide any 
functions for permutation based analysis). Left: Voxel-level inference, 
the maximum f-test value is saved in each permutation. Right: 
Cluster-level inference, the extent of the largest cluster is saved in each 
permutation. A t-value of 3 was used as a cluster defining threshold. The 
data used are beta volumes from 49 subjects, where each beta volume is 
of the size 91 x 109 x 91 voxels. A brain mask was used to only perform 
the statistical calculations for the brain voxels. The processing time for FSL 
increases quickly with the number of regressors, while the processing 



time for BROCCOLI increases much more slowly. This is explained by the 
fact that calculations on a GPU are efficient, once all the data have been 
loaded from the slow global memory to the fast thread specific registers. 
To estimate several beta weights per voxel, instead of a single weight, 
therefore results in a better utilization of the GPU performance. The 
processing time for BROCCOLI using an AMD GPU is 2-5 times as high 
compared to BROCCOLI using a Nvidia GPU. One possible explanation for 
this is that the code was converted from CUDA to OpenCL. Note that 
these processing times are for data normalized to a 2mm'^ MNI template. 
The permutation tests would take approximately eight times longer for 
data normalized to a 1 mm'' MNI template. 



computational capabilities of a GPU. BROCCOLI running on 
a CPU is also faster than the parallel version of FSL. FSL 
divides the work into several CPU cores by using a pack- 
age like Condor or GridEngine. Such an approach cannot 
as easily take advantage of vectorized operations [e.g., Intel 
streaming SIMD extensions (SSE)], where the same opera- 
tion is applied to a number of elements simultaneously. Note 
that this is a distinct, second layer of parallel processing. In 
addition to the code running on several CPU cores instead 
of just one, the processing on each individual core is vector- 
ized, performing 4-16 arithmetic operations on different data at 
once. 

It should also be noted that the presented processing 
times are for fMRI data registered to a 2 mm^ MNI tem- 
plate, each permutation test would take approximately 8 
times longer for data registered to a 1 mm^ MNI template. 
Threshold free cluster enhancement (Smith and Nichols, 
2009) is another inference method that would benefit 
from GPU acceleration, as it is much more computation- 
ally demanding compared to voxel-level and cluster-level 
inference. 

4.5. LIMITATIONS 

The following list itemizes the current limitations of using 
BROCCOLI: 

• BROCCOLI currently has very limited support for image 
segmentation, but such algorithms are often easy to run in 
parallel (Eldund et al, 2013a). 

• The quality of the fMRI-to-Tl registration has not been tested 
as extensively as the Tl-to-MNI registration. There are, at least. 



two reasons why the fMRI-to-Tl registration is harder to test 
than the Tl-to-MNI registration. First, the fMRI data is of 
much lower spatial resolution and an average of 198 registered 
fMRI volumes would therefore be extremely blurry. Second, 
the fMRI data is often distorted due to artifacts from the MRI 
sequence. 

• The SPM, FSL, and AFNI software packages have been used for 
a long time and have been extensively tested, while BROCCOLI 
is completely new software. 

• SPM, FSL, and AFNI all provide a graphical user interface, 
which BROCCOLI currently does not. 

• SPM, FSL, and AFNI aU provide a large number of func- 
tions which can be combined to basically solve any prob- 
lem. BROCCOLI is on the other hand currently limited 
to image registration and first and second level fMRI 
analyses. 

• SPM, FSL, and AFNI all provide some sort of community 
forum where users can get help. 

4.6. FUTURE WORK 

In the future, BROCCOLI can be improved and extended in sev- 
eral ways. The most important addition may be a graphical user 
interface, so that as many researchers as possible can take advan- 
tage of parallel processing. For the first version of BROCCOLI 
we have focused on functionality and stability, and not so much 
on the computational performance. As most of the code was 
converted from CUDA to OpenCL, it is likely that BROCCOLI 
performs best for Nvidia GPUs. Optimizing the code for other 
hardware platforms (e.g., Intel and AMD) will therefore be one 
important project (Enmyren and Kessler, 2010). For permuta- 
tion tests involving large datasets, multi-GPU support can be used 
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to further reduce the computational burden, by running a num- 
ber of permutations on each GPU (Ekiund et al, 2011a). First 
level analysis can also run in parallel on several GPUs with multi- 
GPU support, such that each GPU independently processes one 
subject. Another natural extension would be to provide several 
other wrappers for BROCCOLI, such as R and bash. 

Rather than using ordinary least squares to estimate beta 
weights in the GLM, it would be interesting to, for example, use 
a regularized regression approach such as LASSO (Tibshirani, 
1996) instead. LASSO is often used together with cross valida- 
tion, and would be rather time consuming to run for every voxel. 
This is especially true if LASSO is combined with a permuta- 
tion procedure, to correct for multiple comparisons. Most fMRI 
researchers use the GLM for the statistical analysis, but multi- 
variate approaches that adaptively combine timeseries of several 
voxels can, in some cases, yield higher statistical power. We would 
therefore also like to convert our existing CUDA code for canoni- 
cal correlation analysis (CCA) (Friman et al., 2003; Ekiund et al., 
2011a) to OpenCL and include it in BROCCOLI. The null dis- 
tribution of canonical correlations is much more complicated 
than conventional f-tests, a problem which can be solved with 
permutation-based procedures. 
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